Sains Malaysiana 54(11)(2025): 2773-2783
http://doi.org/10.17576/jsm-2025-5411-16
Determination of the Optimal Number of PLS Components Based on the Combination of Cross-Validation and RMD-MRCD-PCA
Weighting Function
(Penentuan Bilangan Komponen PLS yang Optimum berasaskan Gabungan
Pengesahan Silang dan Fungsi Pemberat RMD-MRCD-PCA)
HABSHAH MIDI1, SITI ZAHARIAH ABDUL WAHAB2,* & AZREE SHAHREL AHMAD NAZRI1
1Institute for Mathematical Research, Universiti Putra Malaysia, 43400 UPM Serdang,
Selangor, Malaysia
2Malaysian Institute of Information Technology, Universiti Kuala Lumpur, 50250 Kuala Lumpur, Malaysia
Received: 19 November 2024/Accepted: 24 October 2025
ABSTRACT
Partial least squares (PLS) regression is a very
useful tool for the analysis of high dimensional data (HDD). Choosing the ideal
number of PLS components is a vital step in developing the best model. The
accuracy of the model will be affected if there are too many or too few PLS
components being selected. Numerous classical methods, such as the
leave-one-out cross-validation (LOOCV) and K-fold cross-validation (K-FoldCV) are developed to determine the optimal number of
PLS components. Nonetheless, they are easily affected by high leverage points
(HLPs). Thus, robust cross validation techniques, denoted as RMD- MRCD-PCA-LOOCV
and RMD-MRCD-PCA-K-FoldCV are proposed to remedy this problem. The results of
the simulation study and real data set indicate that the proposed methods
successfully select the appropriate number of PLS components.
Keywords: High leverage points;
leave-one-out cross validation; minimum regularized covariance determinant; partial
least squares; principal component analysis
Abstrak
Regresi kuasadua kecil separa (PLS) adalah kaedah yang sangat berguna bagi menganalisis data berdimensi tinggi (HDD). Pemilihan bilangan komponen PLS yang ideal adalah langkah penting bagi membangunkan model terbaik. Ketepatan model akan dipengaruhi sekiranya terlalu banyak atau terlalu sedikit komponen PLS yang dipilih. Pelbagai kaedah klasik seperti pengesahan silang leave-one-out (LOOCV) dan pengesahan silang lipatan K (K-FoldCV) dibangunkan untuk menentukan bilangan komponen PLS yang optimum. Namun begitu, mereka mudah dipengaruhi oleh titik tuasan tinggi (HLPs). Oleh itu teknik pengesahan silang teguh yang ditandakan dengan RMD- MRCD-PCA-LOOCV
dan RMD-MRCD-PCA-K-FoldCV dicadangkan bagi menyelesaikan masalah ini. Keputusan kajian simulasi dan set data sebenar menunjukkan kaedah yang dicadangkan berjaya memilih bilangan komponen PLS yang sesuai.
Kata kunci: Analisis komponen utama; kuasadua terkecil separa; penentu kovarian teratur minimum; pengesahan silang leave-one out; titik tuasan tinggi
REFERENCES
Abdullah Mohammed Rashid & Habshah Midi. 2023.
Improved nu-support vector regression algoritm based on the principal component analysis. Economic
Computation and Economic Cybernetics Studies and Research 57(2): 41-56.
https://doi.org/10.24818/18423264/57.2.23.03
Abdullah Mohammed Rashid, Habshah Midi, Waleed Dhhan
& Jayanthi Arasan. 2021. Detection of outliers in high-dimensional data
using nu-support vector regression. Journal of Applied Statistics 49(10): 2550-2569. https://doi.org/10.1080/02664763.2021.1911965
Ali Mohammed Baba, Habshah Midi & Nur Haizum Abd
Rahman. 2022. Spatial outlier accommodation using a spatial variance shift
outlier model. Mathematics 10(17): 3182.
https://doi.org/10.3390/math10173182
Boudt, K., Rousseeuw, P.J., Vanduffel, S. &
Verdonck, T. 2018. The minimum regularized covariance determinant estimator. Statistics
and Computing 30: 113-128. https://doi.org/10.1007/s11222-019-09869-x
Coakley, C.W. & Hettmansperger, T.P. 1993. A bounded
influence, high breakdown, efficient regression estimator. Journal of the
American Statistical Association 88(423): 872-880.
https://doi.org/10.1080/01621459.1993.10476352
Filzmoser, P., Liebmann, B. & Varmuza, K. 2009.
Repeated double cross validation. Journal of Chemometrics 23(4): 160-171.
https://doi.org/10.1002/cem.1225
Geisser, S. 1975. The predictive sample reuse
method with applications. Journal of the American Statistical Association 70(350): 320-328. https://doi.org/10.1080/01621459.1975.10479865
Habshah Midi, Jaaz Suhaiza, Mohd Aslam, Hani Syahida
& Emi Amielda. 2025. Improved robust principal component analysis based on
minimum regularized covariance determinant for the detection of high leverage
points in high dimensional data. Sains Malaysiana 54(8): 2087-2097.
Habshah Midi, Shelan Saied
Ismaeel, Jayanthi Arasan & Mohammed A Mohammed. 2021. Simple and fast
generalized - M (GM) estimator and its application to real data. Sains
Malaysiana 50(3): 859-867.
Hubert, M. & Branden, K.V. 2003. Robust methods
for partial least square regression. Journal of Chemometrics 17(10):
537-549.
Li, B., Morris, J. & Martin, E.B. 2002. Model selection
for partial least squares regression. Chemometrics Intell. Lab. Syst. 64(1): 79-89. https://doi.org/10.1016/S0169-7439(02)00051-5
Mosteller, F. & Wallace, D.L. 1963. Inference
in an authorship problem. Journal of the American Statistical Association 58(302): 275-309. https://doi.org/10.4135/9781412961288.n9
Nengsih, T.A., Bertrand, F., Maumy-Bertrand, M.
& Meyer, N. 2019. Determining the number of components in PLS regression on
incomplete data set. Statistical Applications in Genetics and Molecular
Biology18(6):/j/sagmb.2019.18.issue-6/sagmb-2018-0059/sagmb-2018-0059.xml. https://doi.org/10.1515/sagmb-2018-0059
Rousseeuw, P.J. & van Zomeren, B.C. 1990.
Unmasking multivariate outliers and leverage points. Journal of the American
Statistical Association 85(411):
633-639. https://doi.org/10.1080/01621459.1990.10474920
Siti Zahariah & Habshah Midi. 2022. Minimum regularized
covariance determinant and principal component analysis-based method for the
identification of high leverage points in high dimensional sparse data. Journal
of Applied Statistics 50(13): 2817-2835.
https://doi.org/10.1080/02664763.2022.2093842
Waleed Dhhan, Sohel Rana & Habshah Midi.
2016. A high breakdown, high efficiency and bounded influence modified GM
estimator based on support vector regression. Journal of Applied Statistics 44(4): 700-714. https://doi.org/10.1080/02664763.2016.1182133
*Corresponding author; email:
sitizahariah@unikl.edu.my